pacman::p_load(sf, tidyverse, matrixStats, ggplot2, ggpubr, ggstatsplot, heatmaply,
GWmodel, SpatialML, tmap, rsample, Metrics, knitr, kableExtra, httr, jsonlite, rvest)Take Home Exercise 03
3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods
Background
Housing in Singapore has always been a hot topic, with purposes of investment to accommodation. There are many factors affecting the price, mainly divded into structural and location factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.
However, we also need to consider that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998)
Objectives
Using multiple linear regression and geographical weighted models, I would like to:
- Predict housing prices based on structural and locational factors.
- Incorporate geographical factors to enhance the predictive model.
- Compare the created models
Data Sources
(saved under ‘data’ folder)
Master Plan 2019 Subzone Boundary (from In-Class Ex01)
Resale flat prices based on registration date from Jan-2017 onwards from data.gov.sg
1 Setting Up
1.1 Loading R Packages
I will be using the following R packages:
-sf package to perform geospatial wrangling tasks
- tidyverse package for reading csv files, dataframe processing tasks
- matrixStats package for matrix processing
- ggplot2, ggpubr and ggstatsplot package for plotting statistical graphics
- heatmaply package for attribute standardisation
- DT, crosstalk and htmltools, package for visualizing results in a table format
- Kendall package for performing and visualizing Mann-Kendall Test
- tmap package for plotting tasks
2 Procuring Dataset
Steps used in this sections is largely adapted from the Take-home_Ex3b-HDBDataPrep provided by Prof Kam and other data preprocessing steps by Hao Xian, Wen Yang and Pierre Jean Michel.
The goal of this section is to supplement the resale flat price data from data.gov.sg with other geospatial and locational factors. While my seniors have provided useful links as to where those information can be obtained, I’ll cross verify it with other updated sources where possible.
| Factor | Type | Source |
|---|---|---|
| Proximity to CBD | Location | - |
| Proximity to Eldercare | Location | OneMap API, data.gov.sg |
| Proximity to Foodcourt/Hawker Centres | Location | |
| Proximity to MRT | Location | LTA Data Mall |
| Proximity to Park | Location | OneMap API, data.gov.sg |
| Proximity to Good Primary School | Location | OneMap API (list from MOE Website) |
| Proximity to Shopping Mall | Location | OpenStreetMap, OneMap API (Wikipedia) |
| Proximity to Supermarket | Location | |
| No. of Kindergartens within 350m | Location | OneMap API, data.gov.sg |
| No. of Childcare Centres within 350m | Location | OneMap API |
| No. of Bus Stop within 350m | Location | LTA Data Mall |
| No. of Primary Schools within 1km | Location | OneMap API (list from MOE Website) |
| Main Upgrading Program | Structural | HDB Website |
2.1 Geospatial Data
Before proceeding to get the various locational and structural data, I will first get the geospatial data for the relevant resale flats of interest. This process makes use of the OneMap API, making calls to retrieve latitude and longitude information given a specific address.
To do so, I will need to filter the dataset to be within Jan 2023 and Sep 2024 after loading it in with read_csv() from tidyr package. Additional pre-processing steps are also done to create the address field, the remaining lease year and month fields.
# Load resale data
resale <- read_csv("data/aspatial/resale.csv") %>%
filter(month >= "2023-01" & month <= "2024-09")Rows: 192234 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preprocess data
resale_tidy <- resale %>%
mutate(address = paste(block,street_name)) %>%
mutate(remaining_lease_yr = as.integer(
str_sub(remaining_lease, 0, 2)))%>%
mutate(remaining_lease_mth = as.integer(
str_sub(remaining_lease, 9, 11)))A unique list of the address is obtained to minimize making duplicate API calls to the same address
# Unique list
add_list <- sort(unique(resale_tidy$address))
# Unique address by separate columns
unique_address <- resale_tidy %>%
distinct(town, street_name, block)
# Save unique address
write_csv(unique_address,"data/aspatial/unique_address.csv")The function to retrieve the coordinates of the location given a specific address is as follows:
# Function to retrieve coordinates
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
}
else {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}After getting the coordinates, they are then saved as rds format for easy retrieval.
# Get coordinates of hdb addresses
coords <- get_coords(add_list)
# Save coordinates as rds
write_rds(coords, "data/processed/coords_full.rds")The coordinates will be joined together with the resale_tidy dataset. A quick check is needed to ensure that all coordinates retrieved by this process have valid latitude and longitude. While there are a couple of addresses with no postal code, I can just proceed since there are still latitude and longitude coordinates for them.
# Load coordinates data
coords <- read_rds("data/processed/coords_full.rds")
# Check coordinates data
coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ] address postal latitude longitude
2129 215 CHOA CHU KANG CTRL NIL 1.38308302434129 103.747076627693
2141 216 CHOA CHU KANG CTRL NIL 1.38308302434129 103.747076627693
# Join coordinates to resale_tidy dataset
resale_sf <- left_join(resale_tidy, coords, by = c('address' = 'address')) %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs=3414)2.2 Locational Data
Using httr and jsonlite, I will first get the full list of available themes in OneMap API. The token can be requested from the OneMap website.
token <- 'your_token_here'# Url to get Theme Info
url <- "https://www.onemap.gov.sg/api/public/themesvc/getAllThemesInfo?moreInfo=Y"
# Update query params and header
headers <- httr::add_headers(
Authorization = token
)
# Make the GET request
response <- httr::GET(url, headers)
# Extract the information
content <- content(response, "text")
theme_df <- as.data.frame(fromJSON(content))
# Save theme as rds
write_rds(theme_df, "data/processed/theme.rds")The Master Plan Subzone boundary data (2019) is also loaded with st_read() from sf package and projected accordingly with st_transform() . This will be important for visualisation of different geospatial data in the subsequent steps.
mpsz19_shp <- st_read(dsn = "data/geospatial/MPSZ-2019/", layer = "MPSZ-2019") %>%
st_transform(crs = 3414)Reading layer `MPSZ-2019' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\MPSZ-2019'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
2.2.1 CBD
The central business district (CBD), is an important region in Singapore since it is where many businesses reside. This is important particularly to working adults as staying at a HDB that is located closer to the CBD could mean the shorter travelling time between their workplace and home. While in recent years there are various other business districts (East - Changi Business Park, Tampines, Paya Lebar; West - One North Business Park), this exercise will only consider the Central Business District.
CBD is defined by Google to have the following coordinates: latitude: 1.287953, longitude 103.851784. I will save this as a sf dataframe with st_as_sf() from sf package. I will then save this in rds.
# Cbd lat lon data
cbd_lat <- 1.287953
cbd_lon <- 103.851784
# Create cbd_sf dataframe
cbd_sf <- data.frame(cbd_lat, cbd_lon) %>%
st_as_sf(coords = c("cbd_lon", "cbd_lat"), crs = 4326) %>%
st_transform(crs=3414)
# Save as rds
write_rds(cbd_sf, "data/processed/cbd.rds")2.2.2 Eldercare
From the themes, I can see that the query parameter for Eldercare is “eldercare”. After retrieving the data with a GET request, I will save it as a rds file for easy retrieval.
base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "eldercare"
# Update query params and header
headers <- httr::add_headers(
Authorization = token
)
url <- paste0(base_url,parameter)
# Make the GET request
response <- httr::GET(url, headers)
# Extract the information
content <- content(response, "text")
eldercare_df <- as.data.frame(fromJSON(content))
# Save eldercare_df
write_rds(eldercare_df, "data/processed/eldercare_df.rds")Now, I can load the data and check it.
# Load eldercare_df
eldercare_df <- read_rds("data/processed/eldercare_df.rds")
# Check output
head(eldercare_df) SrchResults.FeatCount SrchResults.Theme_Name SrchResults.Category
1 133 Eldercare Services Family
2 NA <NA> <NA>
3 NA <NA> <NA>
4 NA <NA> <NA>
5 NA <NA> <NA>
6 NA <NA> <NA>
SrchResults.Owner SrchResults.DateTime SrchResults.Published_Date
1 MINISTRY OF HEALTH 2016-07-28T19:23:22+00:00 2012-01-01T00:00:00+00:00
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>
SrchResults.Formatted_DateTime SrchResults.Formatted_Published_Date
1 28/07/2016 01/01/2012
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
SrchResults.NAME SrchResults.ADDRESSPOSTALCODE
1 <NA> <NA>
2 Yuhua Senior Activity Centre 601318
3 THK SAC @ Kaki Bukit 462509
4 THK SAC @ Boon Lay 640190
5 PEACE-Connect Senior Activity Centre@5 190005
6 THK SAC @ Beo Crescent 160044
SrchResults.ADDRESSSTREETNAME SrchResults.Type
1 <NA> <NA>
2 318A Jurong East Avenue 1 #02-308 Point
3 Blk 509B Bedok North Street 3 #02-157 Point
4 Blk 190 Boon Lay Drive #01-242 Point
5 5 Beach Rd #02-4943 Point
6 Blk 44 Beo Crescent #01-67 Point
SrchResults.LatLng SrchResults.ICON_NAME
1 <NA> <NA>
2 1.34762352332232,103.731009086814 onemap-fc-eldercare.png
3 1.33369344925001,103.93039693278 onemap-fc-eldercare.png
4 1.3450767624509,103.711850429975 onemap-fc-eldercare.png
5 1.30434623485538,103.864815785476 onemap-fc-eldercare.png
6 1.28880916599737,103.826295090766 onemap-fc-eldercare.png
After looking at the eldercare dataframe, it is clear that i will need to do some additional processing
- Separate the LatLng column (single column of into lat and lon)
- Select only the columns i need, removing the first column with a filter condition
- Now i can proceed to use
st_as_sf()fromsfpackage to convert it into a sf dataframe st_transform()is then used to do the appropriate projection
Once done, I will save this as an rds file.
# Rename colnames
colnames(eldercare_df) <- gsub("^SrchResults\\.", "", colnames(eldercare_df))
# Process sf dataframe
eldercare_sf <- eldercare_df %>%
separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
select(NAME, ADDRESSPOSTALCODE, ADDRESSSTREETNAME, Type, lat, lon) %>%
filter(!is.na(NAME)) %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(crs=3414)
# Save sf dataframe
write_rds(eldercare_sf, "data/processed/eldercare_sf.rds")Since there is also an eldercare dataset at data.gov.sg, I will load that in along with the previously saved eldercare sf dataframe.
# Load Eldercare Data.gov.sg dataset
eldercare_dg_sf <- st_read("data/geospatial/EldercareServicesSHP", layer = "ELDERCARE") %>%
st_transform(crs = 3414)Reading layer `ELDERCARE' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\EldercareServicesSHP'
using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
# Load Saved OneMap API data
eldercare_sf <- read_rds("data/processed/eldercare_sf.rds")By visualising both datasets side by side with tmap_arrange() from tmap package, I can better see if there are any discrepancies to decide which datasets to use and if any additional processing is required.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
# Calculate row counts
num_eldercare_sf <- nrow(eldercare_sf)
num_eldercare_dg_sf <- nrow(eldercare_dg_sf)
tmap_options(check.and.fix = TRUE)
# Visualize eldercare from OneMap API
map_onemap <- tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(eldercare_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = paste("Eldercare (OneMap API) - Count:", num_eldercare_sf),
main.title.position = ("center"))
# Visualize eldercare from Data.gov.sg
map_data_gov <- tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(eldercare_dg_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = paste("Eldercare (Data Gov) - Count:", num_eldercare_dg_sf),
main.title.position = ("center"))
# Visualize both side by side
tmap_arrange(map_onemap, map_data_gov, ncol = 2, nrow = 1)Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Since the counts for both are the same and there does not seem to be any noticeable discrepancies between either, I can just proceed to use any of them in the subsequent steps.
2.2.3 Foodcourt/Hawker Centres
The hawker centre dataset from data.gov.sg will be loaded in with st_read() and projected with st_transform() from the sf package.
# Load Hawker Data.gov.sg dataset
hawker_sf <- st_read("data/geospatial/HawkerCentresKML.kml") %>%
st_transform(crs = 3414)Reading layer `HAWKERCENTRE' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\HawkerCentresKML.kml'
using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
As with other datasets, I will visualize it with tmap to ensure that it is fine to use.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize hawker centres
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(hawker_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = "Hawker Centre (Data Gov)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.
2.2.4 MRT
The MRT dataset will be loaded in from the TrainStationExit dataset in LTA Datamall with st_read() and projected with st_transform() from the sf package.
# Load train station exit data
train_sf <- st_read(dsn = "data/geospatial/TrainStationExit", layer = "Train_Station_Exit_Layer") %>%
st_transform(crs = 3414)Reading layer `Train_Station_Exit_Layer' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\TrainStationExit'
using driver `ESRI Shapefile'
Simple feature collection with 593 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
As with other datasets, I will visualize it with tmap to ensure that it is fine to use.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize train station exits
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(train_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = "Train Station Exits (LTA DataMall)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.
2.2.5 Park
From the themes, I noticed that there are a few candidates for the query parameter for parks: “nationalparks”, “nparks_parks”, “nr_gaz_2005”. After some trial and error, I decided to go with “nationalparks” and retrieved the data with a GET request. Following which, I will save it as a rds file for easy retrieval.
base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "nationalparks"
# Update query params and header
headers <- httr::add_headers(
Authorization = token
)
url <- paste0(base_url,parameter)
# Make the GET request
response <- httr::GET(url, headers)
# Extract the information
content <- content(response, "text")
parks_df <- as.data.frame(fromJSON(content))
# Save parks_df
write_rds(parks_df, "data/processed/parks_df.rds")Now, I can load the data and check it.
# Load parks_df
parks_df <- read_rds("data/processed/parks_df.rds")
# Check output
head(parks_df) SrchResults.FeatCount SrchResults.Theme_Name SrchResults.Category
1 441 Parks Recreation
2 NA <NA> <NA>
3 NA <NA> <NA>
4 NA <NA> <NA>
5 NA <NA> <NA>
6 NA <NA> <NA>
SrchResults.Owner SrchResults.DateTime SrchResults.Published_Date
1 NATIONAL PARKS BOARD 2024-08-07T17:37:30+00:00 2012-01-01T00:00:00+00:00
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>
SrchResults.Formatted_DateTime SrchResults.Formatted_Published_Date
1 07/08/2024 01/01/2012
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
SrchResults.NAME SrchResults.Type
1 <NA> <NA>
2 POH HUAT RD PG Point
3 LABRADOR NATURE RESERVE Point
4 BUKIT TIMAH NATURE RESERVE Point
5 UPPER SELETAR RESERVOIR PARK Point
6 UPPER PEIRCE RESERVOIR PARK Point
SrchResults.LatLng SrchResults.ICON_NAME
1 <NA> <NA>
2 1.36693711592976,103.881611123538 parks.gif
3 1.26735620427772,103.802037219365 parks.gif
4 1.35405947748311,103.77909858021 parks.gif
5 1.40126178058655,103.807440017621 parks.gif
6 1.3727909318913,103.811866098477 parks.gif
After looking at the parks dataframe, it is clear that i will need to do some additional processing similar with eldercare
- Separate the LatLng column (single column of into lat and lon)
- Select only the columns i need, removing the first column with a filter condition
- Now i can proceed to use
st_as_sf()fromsfpackage to convert it into a sf dataframe st_transform()is then used to do the appropriate projection
Once done, I will save this as an rds file.
# Rename colnames
colnames(parks_df) <- gsub("^SrchResults\\.", "", colnames(parks_df))
# Process sf dataframe
parks_sf <- parks_df %>%
separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
select(NAME, Type, lat, lon) %>%
filter(!is.na(NAME)) %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(crs=3414)
# Save sf dataframe
write_rds(parks_sf, "data/processed/parks_sf.rds")Since there is also an parks dataset at data.gov.sg, I will load that in along with the previously saved parks sf dataframe.
# Load data.gov.sg parks data
parks_dg_sf <- st_read("data/geospatial/Parks.kml") %>%
st_transform(crs = 3414)Reading layer `NATIONALPARKS' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\Parks.kml'
using driver `KML'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
# Load Saved OneMap API data
parks_sf <- read_rds("data/processed/parks_sf.rds")Similar with eldercare, I will visualise both datasets side by side with tmap_arrange() from tmap package, to better see if there are any discrepancies to decide which datasets to use and if any additional processing is required.
# Set tmap mode to view
tmap_mode("plot")tmap mode set to plotting
# Calculate row counts
num_parks_sf <- nrow(parks_sf)
num_parks_dg_sf <- nrow(parks_dg_sf)
tmap_options(check.and.fix = TRUE)
# Visualize parks from OneMap API
map_onemap <- tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(parks_sf) +
tm_dots(col='red', size = 0.1) +
tm_layout(main.title = paste("Parks (OneMap API) - Count:", num_parks_sf),
main.title.position = ("center"))
# Visualize parks from Data.gov.sg
map_data_gov <- tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(parks_dg_sf) +
tm_dots(col='red', size = 0.1) +
tm_layout(main.title = paste("Parks (Data Gov) - Count:", num_parks_dg_sf),
main.title.position = ("center"))
# Visualize both side by side
tmap_arrange(map_onemap, map_data_gov, ncol = 2, nrow = 1)Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Parks from OneMap API are greater than that from Data.gov. Also, Data.gov has 1 park suspiciously south of Singapore in the middle of 2 islands. Therefore, moving forward I will use the OneMap API dataset for Parks instead. Nevertheless, parks from the North-Eastern Islands” will be removed as I dont think they are relevant in consideration for proximity to HDBs. Once done, I will save this as an updated rds file.
# Remove North-Eastern Islands parks
mpsz19_shp_filtered <- mpsz19_shp %>%
filter(SUBZONE_N != "NORTH-EASTERN ISLANDS")
# Filter parks that are within filtered mpsz19
parks_updated_sf <- parks_sf[
apply(st_within(parks_sf, mpsz19_shp_filtered, sparse = FALSE), 1, any),
]Now I will visualize it again to confirm
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize train station exits
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(parks_updated_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = "Parks (OneMap API) Updated",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Now, I can save them to be used in the subseqent steps.
# Save updated park sf dataframe as rds
write_rds(parks_updated_sf, "data/processed/parks_updated_sf.rds")2.2.6 Primary School
The list of primary schools is obtained from the MOE website.
primary_school_list <- c("Admiralty Primary School","Ahmad Ibrahim Primary School","Ai Tong School","Alexandra Primary School",
"Anchor Green Primary School","Anderson Primary School","Ang Mo Kio Primary School","Anglo-Chinese School (Junior)",
"Anglo-Chinese School (Primary)","Angsana Primary School","Beacon Primary School","Bedok Green Primary School",
"Bendemeer Primary School","Blangah Rise Primary School","Boon Lay Garden Primary School","Bukit Panjang Primary School",
"Bukit Timah Primary School","Bukit View Primary School","Canberra Primary School","Canossa Catholic Primary School",
"Cantonment Primary School","Casuarina Primary School","Catholic High School (Primary)","Cedar Primary School",
"Changkat Primary School","CHIJ (Katong) Primary","CHIJ (Kellock)","CHIJ Our Lady of Good Counsel",
"CHIJ Our Lady of the Nativity","CHIJ Our Lady Queen of Peace","CHIJ Primary (Toa Payoh)",
"CHIJ St. Nicholas Girls' School (Primary Section)","Chongfu School","Chongzheng Primary School",
"Chua Chu Kang Primary School","Clementi Primary School","Compassvale Primary School","Concord Primary School",
"Corporation Primary School","Damai Primary School","Dazhong Primary School","De La Salle School",
"East Spring Primary School","Edgefield Primary School","Elias Park Primary School","Endeavour Primary School",
"Evergreen Primary School","Fairfield Methodist School (Primary)","Farrer Park Primary School",
"Fengshan Primary School","Fern Green Primary School","Fernvale Primary School","First Toa Payoh Primary School",
"Frontier Primary School","Fuchun Primary School","Fuhua Primary School","Gan Eng Seng Primary School",
"Geylang Methodist School (Primary)","Gongshang Primary School","Greendale Primary School","Greenridge Primary School",
"Greenwood Primary School","Haig Girls' School","Henry Park Primary School","Holy Innocents' Primary School",
"Hong Wen School","Horizon Primary School","Hougang Primary School","Huamin Primary School","Innova Primary School",
"Jiemin Primary School","Jing Shan Primary School [zh]","Junyuan Primary School","Jurong Primary School",
"Jurong West Primary School","Keming Primary School","Kheng Cheng School","Kong Hwa School","Kranji Primary School",
"Kuo Chuan Presbyterian Primary School","Lakeside Primary School","Lianhua Primary School","Maha Bodhi School",
"Maris Stella High School (Primary Section)","Marsiling Primary School","Marymount Convent School",
"Mayflower Primary School","Mee Toh School","Meridian Primary School","Methodist Girls' School (Primary)",
"Montfort Junior School","Nan Chiau Primary School","Nan Hua Primary School","Nanyang Primary School",
"Naval Base Primary School","New Town Primary School","Ngee Ann Primary School","North Spring Primary School",
"North View Primary School","North Vista Primary School","Northland Primary School","Northoaks Primary School",
"Northshore Primary School","Oasis Primary School","Opera Estate Primary School","Palm View Primary School",
"Park View Primary School","Pasir Ris Primary School","Paya Lebar Methodist Girls' School (Primary)",
"Pei Chun Public School","Pei Hwa Presbyterian Primary School","Pei Tong Primary School","Peiying Primary School",
"Pioneer Primary School","Poi Ching School","Princess Elizabeth Primary School","Punggol Cove Primary School",
"Punggol Green Primary School","Punggol Primary School","Punggol View Primary School","Qifa Primary School",
"Qihua Primary School","Queenstown Primary School","Radin Mas Primary School","Raffles Girls' Primary School",
"Red Swastika School","River Valley Primary School","Riverside Primary School","Rivervale Primary School",
"Rosyth School","Rulang Primary School","Sembawang Primary School","Seng Kang Primary School",
"Sengkang Green Primary School","Shuqun Primary School","Si Ling Primary School",
"Singapore Chinese Girls’ School (Primary)","South View Primary School","Springdale Primary School",
"St. Andrew's Junior School","St. Anthony's Canossian Primary School","St. Anthony's Primary School",
"St. Gabriel's Primary School","St. Hilda's Primary School","St. Joseph's Institution Junior",
"St. Margaret's Primary School","St. Stephen's School","Tampines North Primary School","Tampines Primary School",
"Tanjong Katong Primary School","Tao Nan School","Teck Ghee Primary School","Teck Whye Primary School",
"Telok Kurau Primary School","Temasek Primary School","Townsville Primary School","Unity Primary School",
"Valour Primary School","Waterway Primary School","Wellington Primary School","West Grove Primary School",
"West Spring Primary School","West View Primary School","Westwood Primary School","White Sands Primary School",
"Woodgrove Primary School","Woodlands Primary School","Woodlands Ring Primary School","Xinghua Primary School",
"Xingnan Primary School","Xinmin Primary School","Xishan Primary School","Yangzheng Primary School",
"Yew Tee Primary School","Yio Chu Kang Primary School","Yishun Primary School","Yu Neng Primary School",
"Yuhua Primary School","Yumin Primary School","Zhangde Primary School","Zhenghua Primary School","Zhonghua Primary School"
)For each of the primary schools, I will reuse the get_coords function to retrieve the coordinates. These will then be saved as rds for easy retrieval.
# Get coordinates of primary schools
primary_school_coords <- get_coords(primary_school_list)
# Save coordinates as rds
write_rds(primary_school_coords, "data/processed/primary_school_coords.rds")Now i can load in the primary school coordinates and check if there are any missing data.
# Load primary school coordinates
primary_school_coords <- read_rds("data/processed/primary_school_coords.rds")
# Check coordinates data
primary_school_coords[(is.na(primary_school_coords$postal) | is.na(primary_school_coords$latitude) | is.na(primary_school_coords$longitude) | primary_school_coords$postal=="NIL"), ][1] address postal latitude longitude
<0 rows> (or 0-length row.names)
Now that the coordinates are fine, I can proceed to convert them into a sf dataframe with st_as_sf() and do the appropriate projection with st_transform(). It will then be saved as a rds file.
# Convert primary_school_coords to sf
primary_sch_sf <- primary_school_coords %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs=3414)
# Save primary_sch_sf as rds
write_rds(primary_sch_sf, "data/processed/primary_sch_sf.rds")Now i can load the rds file back.
# Load prmary school rds
primary_sch_sf <- read_rds("data/processed/primary_sch_sf.rds")As with other datasets, I will visualize it with tmap to ensure that it is fine to use.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize primary schools
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(primary_sch_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = "Primary Schools (OpenMap API)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps. However, 1 further step is required given that we want to identify “good” primary schools. Although my seniors used schlah, I decided to go outofbox after cross referencing its list with others (e.g. propertyguru, asipretutor, etc.) While outofbox does not have a sophisticated model like schlah, I find that demand through oversubscription is the purest form of demand. For this list, I will use the top 20 which is about ~10% of the primary schools in Singapore.
After filtering, this will be saved to be used in the later sections.
# List of good primary schools
good_primary_school_list <- c("Methodist Girls' School (Primary)", "Tao Nan School", "Ai Tong School",
"Holy Innocents' Primary School", "CHIJ St. Nicholas Girls' School (Primary Section)",
"Admiralty Primary School", "St. Joseph's Institution Junior", "Catholic High School (Primary)",
"Anglo-Chinese School (Junior)", "Chongfu School", "Kong Hwa School", "St. Hilda's Primary School",
"Anglo-Chinese School (Primary)", "Nan Chiau Primary School", "Nan Hua Primary School",
"Nanyang Primary School", "Pei Hwa Presbyterian Primary School", "Kuo Chuan Presbyterian Primary School",
"Rulang Primary School", "Singapore Chinese Girls’ School (Primary)")
# Filter primary school
gd_primary_sch_sf <- primary_sch_sf %>%
filter(address %in% good_primary_school_list)
# Save to rds
write_rds(gd_primary_sch_sf, "data/processed/gd_primary_sch_sf.rds")2.2.7 Shopping Mall
Shopping mall seems to be the most tricky out of all the other datasets. While there are many lists online, I chanced upon the OpenStreetMap API via the overpass-turbo. Since shopping malls on OpenStreetMap have the attribute “shop”= “mall”, a query can be run to retrieve all shopping malls in Singapore.
Besides this, I have explored the following:
- There is also a
osmdatapackage. While it is also able to extract OpenStreetMap data, the resulting data is counter intuitive (i.e. containing data from Johor Bahru) and it has separate points and polygons output. - OpenStreetMap data from HDX containing buildings from Singapore. After working with it, I found that filtering buildings by “shop” = “mall” does not give much results. This may be because shopping malls are not listed as buildings.
- Go to overpass-turbo website
- Run the following query:
- [out:json][timeout:25]; area[“name”=“Singapore”]->.searchArea; ( node“shop”=“mall”; way“shop”=“mall”; relation“shop”=“mall”; ); out body; >; out skel qt;
- After generating the results, they can be exported as KML and saved into the data folder.

I will load in the shopping mall data from OpenStreetMap with st_read() and do some basic processing:
- Projection with
st_transform()fromsfpackage - Filter out data that have no name
- Use
st_point_on_surface()to convert all “POLYGON” (if any) into “POINT”.
# Load Osm shopping mall data
shoppingmall_sf <- st_read("data/geospatial/Shopping Mall.kml") %>%
st_transform(crs = 3414) %>%
filter(Name != "" & !is.na(Name)) %>%
st_point_on_surface()Reading layer `overpass-turbo.eu export' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\Shopping Mall.kml'
using driver `KML'
Simple feature collection with 255 features and 2 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: 103.6777 ymin: 1.2466 xmax: 103.9945 ymax: 1.448547
Geodetic CRS: WGS 84
Warning: st_point_on_surface assumes attributes are constant over geometries
I will use the list of shopping malls from Wikipedia to cross reference the 2 lists of malls.
# Mall List from Wiki
mall_wiki_list <- c("100 AM","313@Somerset","Aperia","Balestier Hill Shopping Centre","Bugis Cube","Bugis Junction",
"Bugis+","Capitol Piazza","Cathay Cineleisure Orchard","Clarke Quay Central","The Centrepoint",
"City Square Mall","City Gate Mall","CityLink Mall","Duo","Far East Plaza","Funan","Great World City",
"HDB Hub","Holland Village Shopping Mall","ION Orchard","Junction 8","Knightsbridge","Liat Towers",
"Lucky Plaza","Marina Bay Sands","The Shoppes at Marina Bay Sands","Marina Bay Link Mall",
"Marina Square","Millenia Walk","Mustafa Shopping Centre","Ngee Ann City","One Holland Village",
"Orchard Central","Orchard Gateway","Orchard Plaza","Midpoint Orchard","Palais Renaissance",
"People's Park Centre","People's Park Complex","Plaza Singapura","GRiD(pomo)","Raffles City",
"Scotts Square","Shaw House and Centre","Sim Lim Square","Singapore Shopping Centre","The South Beach",
"Square 2","Sunshine Plaza","Suntec City","Tanglin Mall","Tanjong Pagar Centre","Tekka Centre",
"The Adelphi","The Paragon","Tiong Bahru Plaza","The Poiz","Thomson Plaza","United Square","Thomson V",
"Velocity@Novena Square","Wheelock Place","Wisma Atria","Zhongshan Mall","Bedok Mall","Century Square",
"City Plaza","Changi City Point","Downtown East","Djitsun Mall Bedok","Eastpoint Mall","Jewel Changi Airport",
"KINEX (formerly OneKM)","Katong Shopping Centre","Katong Square","Kallang Wave Mall","Leisure Park Kallang",
"i12 Katong","Our Tampines Hub","Parkway Parade","Pasir Ris Mall","Pasir Ris West Plaza","Paya Lebar Square",
"Paya Lebar Quarter (PLQ)","Roxy Square","Singpost Centre","Tampines 1","Tampines Mall","White Sands",
"Elias Mall","Loyang Point","888 Plaza","Admiralty Place","AMK Hub","Canberra Plaza","Causeway Point",
"Broadway Plaza","Jubilee Square","Junction Nine","Marsiling Mall","Northpoint City","Sembawang Shopping Centre",
"Sun Plaza","Vista Point","Wisteria Mall","Woodlands Civic Centre","Woodlands Mart","Woodlands North Plaza",
"Anchorvale Village","Buangkok Square","Compass One","Greenwich V","Heartland Mall","Hougang 1",
"Hougang Green Shopping Mall","Hougang Mall","NEX","Northshore Plaza","Oasis Terraces","Punggol Coast Mall",
"Punggol Plaza","Rivervale Mall","Rivervale Plaza","Sengkang Grand Mall","The Seletar Mall",
"Upper Serangoon Shopping Centre","Waterway Point","myVillage At Serangoon Garden","Beauty World Centre",
"Beauty World Plaza","Bukit Panjang Plaza","Bukit Timah Plaza","Fajar Shopping Centre",
"Greenridge Shopping Centre","Hillion Mall","HillV2","Junction 10","Keat Hong Shopping Centre",
"Limbang Shopping Centre","Lot One","Rail Mall","Sunshine Place","Teck Whye Shopping Centre","West Mall",
"Yew Tee Point","Yew Tee Square","VivoCity","HarbourFront Centre","Alexandra Retail Centre","321 Clementi",
"The Clementi Mall","IMM","Jem","Westgate","Jurong Point","Pioneer Mall","The Star Vista","Alexandra Central",
"Anchorpoint","OD Mall","Boon Lay Shopping Centre","Grantral Mall","Fairprice Hub","Gek Poh Shopping Centre",
"Rochester Mall","Taman Jurong Shopping Centre","West Coast Plaza","Plantation Plaza","Queensway Shopping Centre","The Rail Mall"
)It is always good to visualize the data to check for anomaly.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize shopping malls
tm_shape(shoppingmall_sf) +
tm_dots(col='red', size = 0.1) +
tm_layout(main.title = "Shopping Malls (OpenStreetMap)",
main.title.position = ("center"))
Using setdiff(), I will compare both lists, to see which malls are listed in one list but not the other.
# Only in wiki
only_in_wiki <- setdiff(mall_wiki_list, shoppingmall_sf$Name)
sort(only_in_wiki) [1] "100 AM" "321 Clementi"
[3] "AMK Hub" "Anchorvale Village"
[5] "Aperia" "Balestier Hill Shopping Centre"
[7] "Buangkok Square" "Bugis Cube"
[9] "Capitol Piazza" "City Gate Mall"
[11] "Duo" "Eastpoint Mall"
[13] "Grantral Mall" "GRiD(pomo)"
[15] "HarbourFront Centre" "HillV2"
[17] "Holland Village Shopping Mall" "Jem"
[19] "KINEX (formerly OneKM)" "Lot One"
[21] "Marina Bay Sands" "Marsiling Mall"
[23] "Mustafa Shopping Centre" "myVillage At Serangoon Garden"
[25] "NEX" "Northshore Plaza"
[27] "Oasis Terraces" "OD Mall"
[29] "Our Tampines Hub" "Pasir Ris West Plaza"
[31] "Paya Lebar Quarter (PLQ)" "People's Park Centre"
[33] "People's Park Complex" "Punggol Coast Mall"
[35] "Raffles City" "Rail Mall"
[37] "Shaw House and Centre" "Singapore Shopping Centre"
[39] "Singpost Centre" "Tanjong Pagar Centre"
[41] "Teck Whye Shopping Centre" "The Adelphi"
[43] "The Paragon" "The Poiz"
[45] "The Rail Mall" "The Seletar Mall"
[47] "The South Beach" "Thomson V"
[49] "Velocity@Novena Square" "Woodlands Civic Centre"
[51] "Woodlands Mart" "Woodlands North Plaza"
# Only in osm
only_in_osm <- setdiff(shoppingmall_sf$Name, mall_wiki_list)
sort(only_in_osm) [1] "100AM"
[2] "18 Tai Seng"
[3] "Ang Mo Kio Hub"
[4] "Aperia Mall"
[5] "Balestier Plaza"
[6] "Balmoral Plaza"
[7] "Bencoolen"
[8] "Bencoolen Underground Mall"
[9] "Big Block"
[10] "Bras Basah Complex"
[11] "Buangkok Square Mall"
[12] "Bukit Timah Shopping Center"
[13] "Changi Airport Terminal 1"
[14] "Changi Airport Terminal 2"
[15] "Changi Airport Terminal 3"
[16] "Clementi 321"
[17] "Cluny Court"
[18] "Concorde Shopping Centre"
[19] "Coronation Plaza"
[20] "Dairy Farm Mall"
[21] "Dawson Place"
[22] "Delfi Orchard"
[23] "Depot Heights Shopping Centre"
[24] "Design Orchard"
[25] "Downtown Gallery"
[26] "Duo Residences"
[27] "East Point"
[28] "Eastlink Mall"
[29] "Esplanade Xchange"
[30] "Excelsior Shopping Centre"
[31] "Far East Shopping Centre"
[32] "Fortune Centre"
[33] "Forum The Shopping Mall"
[34] "Fusionopolis 1"
[35] "GR.ID"
[36] "Grandlink Square"
[37] "Grantral Mall @ Clementi"
[38] "Guoco Tower"
[39] "HillV2 Mall"
[40] "Icon Village"
[41] "Jalan Besar Plaza"
[42] "JEM"
[43] "Jewel Changi Basement"
[44] "Joo Chiat Complex"
[45] "Kang Kar Mall"
[46] "Katong Plaza"
[47] "Katong V"
[48] "Kinex"
[49] "Le Quest Mall"
[50] "Liat Tower"
[51] "Liv@Changi"
[52] "Lot One Shoppers' Mall"
[53] "MacPherson Mall"
[54] "Mandarin Gallery"
[55] "Marina One"
[56] "Marsiling Mall Hawker Centre"
[57] "Mustafa Centre"
[58] "myVillage"
[59] "nex"
[60] "Northshore Plaza I"
[61] "Northshore Plaza II"
[62] "Odeon Katong Shopping Complex"
[63] "Old Airport Road Food Centre & Shopping Mall"
[64] "Orchard Building"
[65] "Orchard Gateway @ Emerald"
[66] "Orchard Shopping Centre"
[67] "Orchard Towers"
[68] "OUE Downtown"
[69] "Pacific Plaza"
[70] "Paragon"
[71] "Parklane Shopping Mall"
[72] "Peninsula Plaza"
[73] "Plaza 8"
[74] "Quayside Isle"
[75] "Raffles City Shopping Centre"
[76] "Royal Square at Novena"
[77] "Seletar Mall"
[78] "Shaw House"
[79] "Shaw Plaza"
[80] "Simei Plaza /Blk 248"
[81] "SingPost Centre"
[82] "South Beach Avenue"
[83] "South Beach Quarter"
[84] "STELLAR@TE2"
[85] "Tampines Mart"
[86] "Tang Plaza"
[87] "Tanglin Shopping Centre"
[88] "Tanjong Katong Complex"
[89] "Tekka Place"
[90] "The Cathay"
[91] "The Centris"
[92] "The Concourse"
[93] "The Flow Mall"
[94] "The Heeren"
[95] "The Oasis"
[96] "The Poiz Centre"
[97] "The Woodgrove"
[98] "The Woodleigh Mall"
[99] "Trio"
[100] "Valley Point Shopping Centre"
[101] "Velocity @ Novena Square"
Comparing both lists, I found that there are a number of malls that are labelled different in both lists. This should not be an issue. However, there are some malls in the Wikipedia list that are not in the OpenStreetMap list.
After further investigation, I compiled the list of missing malls. There are also malls that are listed in the Wikipedia but I found it to be not relevant for the current exercise:
- OD Mall: Removed
- Our Tampines Hub: Not a mall
- Punggol Coast Mall: Still under construction
- Rail Mall
- Thomson V: Not a mall
# Compile list of missing malls
missing_malls <- c("Anchorvale Village", "Balestier Hill Shopping Centre", "Bugis Cube", "Capitol Piazza", "City Gate Mall",
"HarbourFront Centre", "Holland Road Shopping Centre", "Oasis Terraces", "Pasir Ris West Plaza",
"Paya Lebar Quarter", "People's Park Centre", "People's Park Complex", "Singapore Shopping Centre",
"Teck Whye Shopping Centre", "The Adelphi", "The Rail Mall", "Woodlands Civic Centre",
"Woodlands Mart", "Woodlands North Plaza")Using the get_coords function, I can then retrieve the geospatial information for those malls. These will be saved as rds for easy retrieval.
# Get coordinates of missing malls
missing_malls_coords <- get_coords(missing_malls)
# Save coordinates as rds
write_rds(missing_malls_coords, "data/processed/missing_malls_coords.rds")Now, I will load it back in to check if there are any missing coordinates.
# Load missing malls coordinates
missing_malls_coords <- read_rds("data/processed/missing_malls_coords.rds")
# Check coordinates data
missing_malls_coords[(is.na(missing_malls_coords$postal) | is.na(missing_malls_coords$latitude) | is.na(missing_malls_coords$longitude) | missing_malls_coords$postal=="NIL"), ][1] address postal latitude longitude
<0 rows> (or 0-length row.names)
Now that the coordinates are fine, I can proceed to convert them into a sf dataframe with st_as_sf() and do the appropriate projection with st_transform(). Once done, a slight adjustment will be made to the column names so that it can be easily appended to the shopping mall sf dataframe with rbind().
The full shopping mall sf dataframe will then be saved as a rds file.
# Convert missing_malls_coords to sf
missing_malls_sf <- missing_malls_coords %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs=3414)
# Process column names to prepare for rbind
colnames(missing_malls_sf)[colnames(missing_malls_sf) == "address"] <- "Name"
colnames(missing_malls_sf)[colnames(missing_malls_sf) == "postal"] <- "Description"
# Merge both datasets
shopping_mall_full_sf <- rbind(shoppingmall_sf, missing_malls_sf)
# Save shopping_mall_full
write_rds(shopping_mall_full_sf, "data/processed/shopping_mall_sf.rds")This can then be loaded in for one final check.
# Load shopping mall sf
shoppingmall_sf <- read_rds("data/processed/shopping_mall_sf.rds")tmap package will be used to visualize the updated shopping mall points.
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(shoppingmall_sf) +
tm_dots(col='red', size = 0.1) +
tm_layout(main.title = "Shopping Malls (OpenStreetMap + OpenMap API)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.
2.2.8 Supermarket
The supermarket dataset from data.gov.sg will be loaded in with st_read() and projected with st_transform() from the sf package.
# Load Supermarket Data.gov.sg dataset
supermarket_sf <- st_read("data/geospatial/SupermarketsKML.kml") %>%
st_transform(crs = 3414)Reading layer `SUPERMARKETS' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\SupermarketsKML.kml'
using driver `KML'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
As with other datasets, I will visualize it with tmap to ensure that it is fine to use.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize supermarkets
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(supermarket_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = "Supermarkets (Data Gov)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Although the data point in Harborfront Centre seems to be suspiciously located, it is deemed that it is acceptable since Harborfront Centre has 2 supermarkets - “Cold Storage” and “Don Don Donki”. Therefore, I will be using this dataset as well for subsequent analysis.
2.2.9 Kindergarten
From the themes, I can see that the query parameter for Kindergarten is “kindergartens”. After retrieving the data with a GET request, I will save it as a rds file for easy retrieval.
base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "kindergartens"
# Update query params and header
headers <- httr::add_headers(
Authorization = token
)
url <- paste0(base_url,parameter)
# Make the GET request
response <- httr::GET(url, headers)
# Extract the information
content <- content(response, "text")
kindergarten_df <- as.data.frame(fromJSON(content))
# Save as rds
write_rds(kindergarten_df, "data/processed/kindergarten_df.rds")Now, I can load the data and check it
# Load kindergarten_df
kindergarten_df <- read_rds("data/processed/kindergarten_df.rds")
# Check output
head(kindergarten_df) SrchResults.FeatCount SrchResults.Theme_Name SrchResults.Category
1 448 Kindergartens Education
2 NA <NA> <NA>
3 NA <NA> <NA>
4 NA <NA> <NA>
5 NA <NA> <NA>
6 NA <NA> <NA>
SrchResults.Owner SrchResults.DateTime
1 EARLY CHILDHOOD DEVELOPMENT AGENCY 2021-12-01T14:34:56+00:00
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
SrchResults.Published_Date SrchResults.Formatted_DateTime
1 2012-01-01T00:00:00+00:00 01/12/2021
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
SrchResults.Formatted_Published_Date
1 01/01/2012
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
SrchResults.NAME
1 <NA>
2 PCF Sparkletots Preschool @ Cheng San-Seletar Blk 435 (KN)
3 PCF Sparkletots Preschool @ Cheng San-Seletar Blk 533 (KN)
4 PCF Sparkletots Preschool @ Cheng San-Seletar Blk 556 (DS)
5 PCF Sparkletots Preschool @ Chong Pang Blk 107 (KN)
6 PCF Sparkletots Preschool @ Chong Pang Blk 122 (KN)
SrchResults.DESCRIPTION SrchResults.ADDRESSPOSTALCODE
1 <NA> <NA>
2 Kindergartens 560435
3 Kindergartens 560533
4 Kindergartens 560556
5 Kindergartens 760107
6 Kindergartens 760122
SrchResults.ADDRESSSTREETNAME SrchResults.Type
1 <NA> <NA>
2 435 Ang Mo Kio Avenue 10 #01-1393 S(560435) Point
3 533 Ang Mo Kio Avenue 5 #01-4100 S(560533) Point
4 556 Ang Mo Kio Avenue 10 #01-1902 S(560556) Point
5 107 Yishun Ring Road #01-207 S(760107) Point
6 122 Yishun Street 11 #01-479 S(760122) Point
SrchResults.LatLng SrchResults.ICON_NAME
1 <NA> <NA>
2 1.36770367492704,103.854214320416 school.gif
3 1.37416922383175,103.85300192794 school.gif
4 1.37203235928375,103.857625376965 school.gif
5 1.43221807457959,103.827520205691 school.gif
6 1.43499758064097,103.831124799259 school.gif
After looking at the kindergarten dataframe, it is clear that i will need to do some additional processing similar with eldercare
- Separate the LatLng column (single column of into lat and lon)
- Select only the columns i need, removing the first column with a filter condition
- Now i can proceed to use
st_as_sf()fromsfpackage to convert it into a sf dataframe st_transform()is then used to do the appropriate projection
Once done, I will save this as an rds file.
# Rename colnames
colnames(kindergarten_df) <- gsub("^SrchResults\\.", "", colnames(kindergarten_df))
# Process sf dataframe
kindergarten_sf <- kindergarten_df %>%
separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
select(NAME, Type, ADDRESSPOSTALCODE, ADDRESSSTREETNAME, lat, lon) %>%
filter(!is.na(NAME)) %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(crs=3414)
# Save sf dataframe
write_rds(kindergarten_sf, "data/processed/kindergarten_sf.rds")Since there is also a kindergarten dataset at data.gov.sg, I will load that in along with the previously saved kindergarten sf dataframe.
# Load data.gov.sg kindergarten data
kindergarten_dg_sf <- st_read("data/geospatial/Kindergartens.kml") %>%
st_transform(crs = 3414)Reading layer `KINDERGARTENS' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\Kindergartens.kml'
using driver `KML'
Simple feature collection with 448 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6887 ymin: 1.247759 xmax: 103.9717 ymax: 1.455452
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
# Load Saved OneMap API data
kindergarten_sf <- read_rds("data/processed/kindergarten_sf.rds")By visualising both datasets side by side with tmap_arrange() from tmap package, I can better see if there are any discrepancies to decide which datasets to use and if any additional processing is required.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
# Calculate row counts
num_kindergarten_sf <- nrow(kindergarten_sf)
num_kindergarten_dg_sf <- nrow(kindergarten_dg_sf)
tmap_options(check.and.fix = TRUE)
# Visualize kindergarten from OneMap API
map_onemap <- tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(kindergarten_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = paste("Kindergartens (OneMap API) - Count:", num_kindergarten_sf),
main.title.position = ("center"))
# Visualize kindergarten from Data.gov.sg
map_data_gov <- tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(kindergarten_dg_sf) +
tm_dots(col='red', size = 0.2) +
tm_layout(main.title = paste("Kindergartens (Data Gov) - Count:", num_kindergarten_dg_sf),
main.title.position = ("center"))
tmap_arrange(map_onemap, map_data_gov, ncol = 2, nrow = 1)Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Since the counts for both are the same and there does not seem to be any noticeable discrepancies between either, I can just proceed to use any of them in the subsequent steps.
2.2.10 Childcare Centres
base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "childcare"
# Update query params and header
headers <- httr::add_headers(
Authorization = token
)
url <- paste0(base_url,parameter)
# Make the GET request
response <- httr::GET(url, headers)
# Extract the information
content <- content(response, "text")
childcare_df <- as.data.frame(fromJSON(content))
# Save childcare_df
write_rds(childcare_df, "data/processed/childcare_df.rds")Now I can load the data and check it.
# Load childcare_df
childcare_df <- read_rds("data/processed/childcare_df.rds")
# Check output
head(childcare_df) SrchResults.FeatCount SrchResults.Theme_Name SrchResults.Category
1 1925 Child Care Services Family
2 NA <NA> <NA>
3 NA <NA> <NA>
4 NA <NA> <NA>
5 NA <NA> <NA>
6 NA <NA> <NA>
SrchResults.Owner SrchResults.DateTime
1 EARLY CHILDHOOD DEVELOPMENT AGENCY 2021-12-01T09:39:16+00:00
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
SrchResults.Published_Date SrchResults.Formatted_DateTime
1 2012-01-01T00:00:00+00:00 01/12/2021
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
SrchResults.Formatted_Published_Date
1 01/01/2012
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
SrchResults.NAME SrchResults.DESCRIPTION
1 <NA> <NA>
2 APOLLO INTERNATIONAL PRESCHOOL PRIVATE LIMITED Child Care Services
3 APPLE TREE PLAYHOUSE PTE LTD Child Care Services
4 Appleland Montessori Child Care Centre Pte Ltd Child Care Services
5 APPLELAND PLAYHOUSE Child Care Services
6 APRICOT ACADEMY (LAGUNA) PTE. LTD. Child Care Services
SrchResults.ADDRESSPOSTALCODE
1 <NA>
2 467903
3 768019
4 650165
5 103104
6 449290
SrchResults.ADDRESSSTREETNAME
1 <NA>
2 44, LIMAU GARDEN, BEDOK PARK, SINGAPORE 467903
3 1, NORTHPOINT DRIVE, #02 - 201, NORTHPOINT CITY, SINGAPORE 768019
4 165, BUKIT BATOK WEST AVENUE 8, #01 - 286, SINGAPORE 650165
5 104C, DEPOT ROAD, #01 - 03, SINGAPORE 103104
6 5000G, MARINE PARADE ROAD, #01 - 28/30, LAGUNA PARK, SINGAPORE 449290
SrchResults.Type SrchResults.LatLng SrchResults.ICON_NAME
1 <NA> <NA> <NA>
2 Point 1.32238448821008,103.95000405168 onemap-fc-childcare.png
3 Point 1.42803584532257,103.836092102185 onemap-fc-childcare.png
4 Point 1.34733679584283,103.741924411432 onemap-fc-childcare.png
5 Point 1.28054679594446,103.811571536842 onemap-fc-childcare.png
6 Point 1.31004098627353,103.931988310202 onemap-fc-childcare.png
After looking at the childcare dataframe, I will need to do some additional processing similar with eldercare
- Separate the LatLng column (single column of into lat and lon)
- Select only the columns i need, removing the first column with a filter condition
- Now i can proceed to use
st_as_sf()fromsfpackage to convert it into a sf dataframe st_transform()is then used to do the appropriate projection
Once done, I will save this as an rds file.
# Rename colnames
colnames(childcare_df) <- gsub("^SrchResults\\.", "", colnames(childcare_df))
# Process sf dataframe
childcare_sf <- childcare_df %>%
separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
select(NAME, Type, ADDRESSPOSTALCODE, ADDRESSSTREETNAME, lat, lon) %>%
filter(!is.na(NAME)) %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(crs=3414)
# Save sf dataframe
write_rds(childcare_sf, "data/processed/childcare_sf.rds")Now I will load it back in.
# Load Saved OneMap API data
childcare_sf <- read_rds("data/processed/childcare_sf.rds")As with other datasets, I will visualize it with tmap to ensure that it is fine to use.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize childcare
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(childcare_sf) +
tm_dots(col='red', size = 0.1) +
tm_layout(main.title = "Childcare (OneMap API)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.
2.2.11 Bus Stop
The Bus Stop dataset will be loaded in from the Bus Stop Location dataset (Jul 2024) in LTA Datamall with st_read() and projected with st_transform() from the sf package.
bus_sf <- st_read(dsn = "data/geospatial/BusStopLocation_Jul2024", layer = "BusStop") %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\BusStopLocation_Jul2024'
using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21
As with other datasets, I will visualize it with tmap to ensure that it is fine to use.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize bus stops
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(bus_sf) +
tm_dots(col='red', size = 0.1) +
tm_layout(main.title = "Bus Stops (LTA DataMall)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there are some points that are outside of the Singapore boundaries. These are likely the bus stops in Johor Bahru and should be removed as only bus stops within Singapore should be considered.
# Filter bus stops that are within mpsz19
bus_updated_sf <- bus_sf[
apply(st_within(bus_sf, mpsz19_shp, sparse = FALSE), 1, any),
]A final check with tmap will be used to confirm that there are no further anomalies.
# Set tmap mode to plot
tmap_mode("plot")tmap mode set to plotting
tmap_options(check.and.fix = TRUE)
# Visualize bus stops
tm_shape(mpsz19_shp) +
tm_polygons() +
tm_shape(bus_updated_sf) +
tm_dots(col='red', size = 0.1) +
tm_layout(main.title = "Bus Stops Updated (LTA DataMall)",
main.title.position = ("center"))Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Now I can save it for use in subsequent steps.
# Save updated park sf dataframe as rds
write_rds(bus_updated_sf, "data/processed/bus_updated_sf.rds")2.3 Structural Data
While area of the unit has already been provided in the initial dataset, some data wrangling is required to generate the other structural factors.
2.3.1 Floor Level
The floor level information is current being represented in the storey_range attribute. Using unique(), I can check the different values
# Check storey_range
unique(resale_sf$storey_range) [1] "01 TO 03" "04 TO 06" "07 TO 09" "25 TO 27" "10 TO 12" "13 TO 15"
[7] "16 TO 18" "22 TO 24" "19 TO 21" "34 TO 36" "28 TO 30" "37 TO 39"
[13] "31 TO 33" "40 TO 42" "43 TO 45" "46 TO 48" "49 TO 51"
Although I wont be able to get the exact floor level, I can represent it as an ordinal attribute with mutate() and lengthy case_when() statement. Once done, summary() will be used to check if there are any anomalies.
# Create floor_ord
resale_sf <- resale_sf %>%
mutate(floor_ord = case_when(storey_range == "01 TO 03" ~ 1,
storey_range == "04 TO 06" ~ 2,
storey_range == "07 TO 09" ~ 3,
storey_range == "10 TO 12" ~ 4,
storey_range == "13 TO 15" ~ 5,
storey_range == "16 TO 18" ~ 6,
storey_range == "19 TO 21" ~ 7,
storey_range == "22 TO 24" ~ 8,
storey_range == "25 TO 27" ~ 9,
storey_range == "28 TO 30" ~ 10,
storey_range == "31 TO 33" ~ 11,
storey_range == "34 TO 36" ~ 12,
storey_range == "37 TO 39" ~ 13,
storey_range == "40 TO 42" ~ 14,
storey_range == "43 TO 45" ~ 15,
storey_range == "46 TO 48" ~ 16,
storey_range == "49 TO 51" ~ 17,
TRUE ~ 99))
# Check output
summary(resale_sf$floor_ord) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 3.259 4.000 17.000
2.3.2 Remaining Lease
Since I already have the “remaining_lease_yr” and “remaining_lease_mth” from the previous step, I just need to combine them into a remaining_lease. This can be done by consolidating the “remaining_lease_yr” into “remaining_lease_mth”. Since the columns may contain NA, i will use coalesce() from dpylr package to replace them to 0. summary() will be used to do a final check.
# Create remaining_lease_val
resale_sf <- resale_sf %>%
mutate(remaining_lease_val = coalesce(remaining_lease_yr, 0) * 12 + coalesce(remaining_lease_mth, 0))
# Check output
summary(resale_sf$remaining_lease_val) Min. 1st Qu. Median Mean 3rd Qu. Max.
497.0 728.0 881.0 885.2 1084.0 1154.0
2.3.3 Age of Unit
Since this is not given, I will calculate it backwards from the amount of remaining lease. This is assuming that all HDBs in this resale dataset have 99-year leases at the time of purchase.
# Create age of unit
resale_sf <- resale_sf %>%
mutate(age_unit = 99 * 12 - remaining_lease_val)
# Check outuput
summary(resale_sf$age_unit) Min. 1st Qu. Median Mean 3rd Qu. Max.
34.0 104.0 307.0 302.8 460.0 691.0
2.3.4 Main Upgrading Program (MUP)
To get this information, I have to manually crawl the HDB website for Main Upgrading Program (MUP) and Home Improvement Program (HIP). The HIP is the successor of the MUP, introduced in 2007. This is done for the unique_addresses created in 2.1. The completed data is then saved as hdb_mup_hip.csv, which will be loaded to be combined with the resale dataset.
# Load mup_hip dataset
mup_hip <- read_csv("data/aspatial/hdb_mup_hip.csv")Rows: 8960 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): town, street_name, block, key, mup_announce_date, mup_bill_date, h...
dbl (1): mup_status
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
With reference to Manuel Lehner (2011), I will be exploring creating the following binary attributes:
- MUP_completed - if the HDB has been through the main upgrading program
- HIP_announced - if the HDB has been announced for the home improvement program.
- HIP_completed - if the HDB has been through the home improvement program
To facilitate the join, some slight data wrangling is required. dmy() from lubridate package is used to convert the values from character to date. Once done, I will join it to the resale data and create the necessary attributes. I will assume the resale dates to be on the first of the “month” date with ym() from lubridate package. Since the resale dates are in the year 2023 onwards, I can simply assume “mup_status” to be “mup_completed”. HDBs will be coded positive for “hip_completed” if the resale date is later/equal to the hip_bill_date. HDBs will then be coded positive for “hip_announced” if the resale date is before the hip_bill_date but later/equal to the hip_announce_date.
# Process
mup_hip_df <- mup_hip %>%
mutate(address = paste(block,street_name),
hip_announce_date = dmy(hip_announce_date),
hip_bill_date = dmy(hip_bill_date)) %>%
select(address, mup_status, hip_announce_date, hip_bill_date)
# Left join mup_hip_df to resale_sf and create status
resale_merge_sf <- left_join(resale_sf, mup_hip_df, by = "address") %>%
mutate(sale_date = ym(month),
hip_completed = case_when(is.na(hip_bill_date) ~ 0,
sale_date >= hip_bill_date ~ 1,
TRUE ~ 0),
hip_announced = case_when(is.na(hip_announce_date) | hip_completed == 1 ~ 0,
sale_date >= hip_announce_date ~ 1,
TRUE ~ 0)
)
# Check output
head(resale_merge_sf)Simple feature collection with 6 features and 24 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 28346.43 ymin: 38229.07 xmax: 30288.23 ymax: 39555.53
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 25
month town flat_type block street_name storey_range floor_area_sqm flat_model
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 2023… ANG … 2 ROOM 406 ANG MO KIO… 01 TO 03 44 Improved
2 2023… ANG … 2 ROOM 323 ANG MO KIO… 04 TO 06 49 Improved
3 2023… ANG … 2 ROOM 314 ANG MO KIO… 04 TO 06 44 Improved
4 2023… ANG … 2 ROOM 314 ANG MO KIO… 07 TO 09 44 Improved
5 2023… ANG … 2 ROOM 170 ANG MO KIO… 01 TO 03 45 Improved
6 2023… ANG … 3 ROOM 225 ANG MO KIO… 04 TO 06 67 New Gener…
# ℹ 17 more variables: lease_commence_date <dbl>, remaining_lease <chr>,
# resale_price <dbl>, address <chr>, remaining_lease_yr <int>,
# remaining_lease_mth <int>, postal <chr>, geometry <POINT [m]>,
# floor_ord <dbl>, remaining_lease_val <dbl>, age_unit <dbl>,
# mup_status <dbl>, hip_announce_date <date>, hip_bill_date <date>,
# sale_date <date>, hip_completed <dbl>, hip_announced <dbl>
2.4 Proximity Calculation
Since my seniors have graciously provided with a function to calculate proximity (from Megan), I will gratefully leverage it in the subsequent steps.
This proximity function calculates a distance matrix with st_distance() from sf package, then finds the closest one with rowMins() from matrixStats package.
# Proximity function
fn_proximity <- function(df1, df2, varname) {
dist_matrix <- st_distance(df1, df2) %>%
units::drop_units()
df1[, varname] <- rowMins(dist_matrix)
return(df1)
}I will load all the relevant sf dataframe for the various locational factors.
# Load cbd
cbd_sf <- read_rds("data/processed/cbd.rds")
# Load eldercare
eldercare_sf <- read_rds("data/processed/eldercare_sf.rds")
# Load foodcourt/hawker centre
hawker_sf <- st_read("data/geospatial/HawkerCentresKML.kml") %>%
st_transform(crs = 3414)Reading layer `HAWKERCENTRE' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\HawkerCentresKML.kml'
using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
# Load mrt
train_sf <- st_read(dsn = "data/geospatial/TrainStationExit", layer = "Train_Station_Exit_Layer") %>%
st_transform(crs = 3414)Reading layer `Train_Station_Exit_Layer' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\TrainStationExit'
using driver `ESRI Shapefile'
Simple feature collection with 593 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
# Load park
parks_sf <- read_rds("data/processed/parks_updated_sf.rds")
# Load good primary schools
gd_primary_sch_sf <- read_rds("data/processed/gd_primary_sch_sf.rds")
# Load shopping mall
shoppingmall_sf <- read_rds("data/processed/shopping_mall_sf.rds")
# Load supermarket
supermarket_sf <- st_read("data/geospatial/SupermarketsKML.kml") %>%
st_transform(crs = 3414)Reading layer `SUPERMARKETS' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\SupermarketsKML.kml'
using driver `KML'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
The proximity function is then run for all the various datasets.
# Compute proximity for all at once
resale_merge_sf <- fn_proximity(resale_merge_sf, cbd_sf, "PROX_CBD") %>%
fn_proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
fn_proximity(., hawker_sf, "PROX_HAWKER") %>%
fn_proximity(., train_sf, "PROX_MRT") %>%
fn_proximity(., parks_sf, "PROX_PARK") %>%
fn_proximity(., gd_primary_sch_sf, "PROX_GDPRISCH") %>%
fn_proximity(., shoppingmall_sf, "PROX_MALL") %>%
fn_proximity(., supermarket_sf, "PROX_SUPERMARKET")2.5 Facility Count
Similar to proximity, my seniors have graciously provided with a function to calculate proximity (also from Megan).
# Count within radius
fn_num_radius <- function(df1, df2, varname, radius) {
dist_matrix <- st_distance(df1, df2) %>%
units::drop_units() %>%
as.data.frame()
df1[,varname] <- rowSums(dist_matrix <= radius)
return(df1)
}I will load all the remaining sf dataframe for the various locational factors.
# Load kindergarten
kindergarten_sf <- read_rds("data/processed/kindergarten_sf.rds")
# Load childcare
childcare_sf <- read_rds("data/processed/childcare_sf.rds")
# Load bus stop
bus_sf <- read_rds("data/processed/bus_updated_sf.rds")
# Load primary school
primary_sch_sf <- read_rds("data/processed/primary_sch_sf.rds")# Calculate the count within different radius
resale_merge_sf <- fn_num_radius(resale_merge_sf, kindergarten_sf, "WITHIN_350M_KINDERGARTEN", 350) %>%
fn_num_radius(., childcare_sf, "WITHIN_350M_CHILDCARE", 350) %>%
fn_num_radius(., bus_sf, "WITHIN_350M_BUS", 350) %>%
fn_num_radius(., primary_sch_sf, "WITHIN_1KM_PRISCH", 1000)Now I will save this as a rds for easy retrieval.
# Save output as rds
write_rds(resale_merge_sf, "data/processed/resale_full.sf")Finally, I need to do some final preprocessing for the subsequent analysis
- Sales between 2024 January to 2024 June will be filtered out
- Filter to only 4-rooms flats.
- Select only required columns
Once done, I will save this as an rds file for subsequent retrieval.
# Load full resale data
resale_full_sf <- read_rds("data/processed/resale_full.sf")
# Filter resale data and select required columns
resale_flt_sf <- resale_full_sf %>%
filter(!(sale_date >= as.Date("2024-01-01") & sale_date <= as.Date("2024-06-30"))) %>%
filter(flat_type == "4 ROOM") %>%
select(sale_date, town, flat_type, address, resale_price, floor_area_sqm, floor_ord, remaining_lease_val, age_unit, mup_status,
hip_announced, hip_completed, PROX_CBD, PROX_ELDERCARE, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_GDPRISCH,
PROX_MALL, PROX_SUPERMARKET, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, WITHIN_1KM_PRISCH)
# Save output to rds
write_rds(resale_flt_sf, "data/processed/resale_flt_sf.rds")3 Exploratory Data Analysis
Before proceeding to calibrate a predictive model, it is always good to do some exploratory data analysis to better understand the data. I’ll clear the R console to save memory and load in the processed data.
# Clear R console
rm(list = ls(all.names = TRUE))
# Load resale data
resale_sf <- read_rds("data/processed/resale_flt_sf.rds")
# Load mpsz data
mpsz19_shp <- st_read(dsn = "data/geospatial/MPSZ-2019/", layer = "MPSZ-2019") %>%
st_transform(crs = 3414)Reading layer `MPSZ-2019' from data source
`C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\MPSZ-2019'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
3.1 Resale Price
First, I’ll look at the “resale_price”, since it is the all important dependent attribute for the predictive model. gglot2 package will be used to plot a histogram illustrating the distribution for the “resale_price”.
# Plot distribution of resale_price
ggplot(data=resale_sf, aes(x=`resale_price`)) +
geom_histogram(bins=20, color="black", fill="light blue") +
ggtitle("Distribution of Resale Prices") +
theme(plot.title = element_text(hjust = 0.5))
Since it is skewed to the right, I can try to center the distribution with a log transformation. The result will be visualized again with ggplot2 to check the distribution.
# Create new log variable
resale_sf <- resale_sf %>%
mutate(`log_resale_price` = log(resale_price))
# Visualize results
ggplot(data=resale_sf, aes(x=`log_resale_price`)) +
geom_histogram(bins=20, color="black", fill="light blue") +
ggtitle("Distribution of Log Resale Prices") +
theme(plot.title = element_text(hjust = 0.5))
The resale price can also be visualized with an interactive map.
# Turn on interactive mode
tmap_mode("view")tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)
# Create interactive point symbol map
tm_shape(mpsz19_shp)+
tm_polygons() +
tm_shape(resale_sf) +
tm_dots(col = "resale_price",
alpha = 0.6,
style="quantile") +
tm_view(set.zoom.limits = c(11,14))Warning: The shape mpsz19_shp is invalid (after reprojection). See
sf::st_is_valid
Looking at the map, it seems that HDB flats tend to be much more expensive around the downtown area. Interestingly, units towards North-East and East are much noticeably more expensive than those towards the North or West.
3.2 Locational Factors
Now, I’ll proceed to look at the distribution of the various locational factors.
# Create 12 histograms
PROX_CBD <- ggplot(data=resale_sf, aes(x= `PROX_CBD`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_ELDERLYCARE <- ggplot(data=resale_sf, aes(x= `PROX_ELDERCARE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_HAWKER <- ggplot(data=resale_sf, aes(x= `PROX_HAWKER`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_MRT <- ggplot(data=resale_sf, aes(x= `PROX_MRT`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_PARK <- ggplot(data=resale_sf, aes(x= `PROX_PARK`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_GDPRISCH <- ggplot(data=resale_sf, aes(x= `PROX_GDPRISCH`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_MALL <- ggplot(data=resale_sf, aes(x= `PROX_MALL`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_SUPERMARKET <- ggplot(data=resale_sf, aes(x= `PROX_SUPERMARKET`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_350M_KINDERGARTEN <- ggplot(data=resale_sf, aes(x= `WITHIN_350M_KINDERGARTEN`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_350M_CHILDCARE <- ggplot(data=resale_sf, aes(x= `WITHIN_350M_CHILDCARE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_350M_BUS <- ggplot(data=resale_sf, aes(x= `WITHIN_350M_BUS`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_1KM_PRISCH <- ggplot(data=resale_sf, aes(x= `WITHIN_1KM_PRISCH`)) +
geom_histogram(bins=20, color="black", fill="light blue")
# Arrange the 12 historgrams into 3 columns by 4 rows
ggarrange(PROX_CBD, PROX_ELDERLYCARE, PROX_HAWKER, PROX_MRT,
PROX_PARK, PROX_GDPRISCH, PROX_MALL, PROX_SUPERMARKET,
WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, WITHIN_1KM_PRISCH,
ncol = 3, nrow = 4)
Judging from the distributions, all proximity locational factors are right-skewed with the exception of “PROX_CBD” which is slightly left-skewed. For the other locational factors (“WITHIN…”), they seem to all have a somewhat normal distribution with the exception of “WITHIN_350M_KINDERGARTEN” which is right_skewed. However, since I dont know if all the attributes are from some normal distribution, I will instead use normalise() from heatmaply package to standardise the values. Following which, I’ll plot the histograms again to verfiy this.
# Normalize variables for proximity factors
resale_sf <- resale_sf %>%
mutate(std_PROX_CBD = normalize(PROX_CBD),
std_PROX_ELDERCARE = normalize(PROX_ELDERCARE),
std_PROX_HAWKER = normalize(PROX_HAWKER),
std_PROX_MRT = normalize(PROX_MRT),
std_PROX_PARK = normalize(PROX_PARK),
std_PROX_GDPRISCH = normalize(PROX_GDPRISCH),
std_PROX_MALL = normalize(PROX_MALL),
std_PROX_SUPERMARKET = normalize(PROX_SUPERMARKET),
std_WITHIN_350M_KINDERGARTEN = normalize(WITHIN_350M_KINDERGARTEN),
std_WITHIN_350M_CHILDCARE = normalize(WITHIN_350M_CHILDCARE),
std_WITHIN_350M_BUS = normalize(WITHIN_350M_BUS),
std_WITHIN_1KM_PRISCH = normalize(WITHIN_1KM_PRISCH))
# Create 12 histograms
PROX_CBD <- ggplot(data=resale_sf, aes(x= `std_PROX_CBD`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_ELDERLYCARE <- ggplot(data=resale_sf, aes(x= `std_PROX_ELDERCARE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_HAWKER <- ggplot(data=resale_sf, aes(x= `std_PROX_HAWKER`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_MRT <- ggplot(data=resale_sf, aes(x= `std_PROX_MRT`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_PARK <- ggplot(data=resale_sf, aes(x= `std_PROX_PARK`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_GDPRISCH <- ggplot(data=resale_sf, aes(x= `std_PROX_GDPRISCH`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_MALL <- ggplot(data=resale_sf, aes(x= `std_PROX_MALL`)) +
geom_histogram(bins=20, color="black", fill="light blue")
PROX_SUPERMARKET <- ggplot(data=resale_sf, aes(x= `std_PROX_SUPERMARKET`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_350M_KINDERGARTEN <- ggplot(data=resale_sf, aes(x= `std_WITHIN_350M_KINDERGARTEN`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_350M_CHILDCARE <- ggplot(data=resale_sf, aes(x= `std_WITHIN_350M_CHILDCARE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_350M_BUS <- ggplot(data=resale_sf, aes(x= `std_WITHIN_350M_BUS`)) +
geom_histogram(bins=20, color="black", fill="light blue")
WITHIN_1KM_PRISCH <- ggplot(data=resale_sf, aes(x= `std_WITHIN_1KM_PRISCH`)) +
geom_histogram(bins=20, color="black", fill="light blue")
# Arrange the 12 historgrams into 3 columns by 4 rows
ggarrange(PROX_CBD, PROX_ELDERLYCARE, PROX_HAWKER, PROX_MRT,
PROX_PARK, PROX_GDPRISCH, PROX_MALL, PROX_SUPERMARKET,
WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, WITHIN_1KM_PRISCH,
ncol = 3, nrow = 4)
3.2 Structural Factors
Now I’ll look at the structural factors.
# Create 4 histograms
UNIT_AREA <- ggplot(data=resale_sf, aes(x= `floor_area_sqm`)) +
geom_histogram(bins=20, color="black", fill="light blue")
UNIT_FLOOR <- ggplot(data=resale_sf, aes(x= `floor_ord`)) +
geom_histogram(bins=20, color="black", fill="light blue")
UNIT_LEASE <- ggplot(data=resale_sf, aes(x= `remaining_lease_val`)) +
geom_histogram(bins=20, color="black", fill="light blue")
UNIT_AGE <- ggplot(data=resale_sf, aes(x= `age_unit`)) +
geom_histogram(bins=20, color="black", fill="light blue")
# Arrange the 4 historgrams into 2 columns by 2 rows
ggarrange(UNIT_AREA, UNIT_FLOOR, UNIT_LEASE, UNIT_AGE,
ncol = 2, nrow = 2)
From the charts, I can tell that the unit area follows a somewhat normal distribution. The floor level is understandably right-skewed as there not many high-rise HDB buildings proportionately. Remaining lease and unit age then seem to have a uniform distribution. Regardless, I’ll do the same standardization process for them with normalize() from heatmaply package.
# Normalize variables for proximity factors
resale_sf <- resale_sf %>%
mutate(std_UNIT_AREA = normalize(floor_area_sqm),
std_UNIT_FLOOR = normalize(floor_ord),
std_UNIT_LEASE = normalize(remaining_lease_val),
std_UNIT_AGE = normalize(age_unit))
# Create 4 histograms
UNIT_AREA <- ggplot(data=resale_sf, aes(x= `std_UNIT_AREA`)) +
geom_histogram(bins=20, color="black", fill="light blue")
UNIT_FLOOR <- ggplot(data=resale_sf, aes(x= `std_UNIT_FLOOR`)) +
geom_histogram(bins=20, color="black", fill="light blue")
UNIT_LEASE <- ggplot(data=resale_sf, aes(x= `std_UNIT_LEASE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
UNIT_AGE <- ggplot(data=resale_sf, aes(x= `std_UNIT_AGE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
# Arrange the 4 historgrams into 2 columns by 2 rows
ggarrange(UNIT_AREA, UNIT_FLOOR, UNIT_LEASE, UNIT_AGE,
ncol = 2, nrow = 2)
To analyse the MUP/HIP status, I’ll create a categorical variable. prop.table() will then be used to display the proportions.
# Combine MUP/HIP
resale_sf <- resale_sf %>%
mutate(mup_hip_status = case_when(mup_status == 1 ~ 1,
hip_announced == 1 ~ 2,
hip_completed == 1 ~ 3,
TRUE ~ 0))
# Show proportions
prop.table(table(resale_sf$mup_hip_status))
0 1 2 3
0.67354418 0.03257771 0.08022261 0.21365549
From this, I can see that majority of the HDB flats (67%) have not undergone MUP/HIP. This could prove to be interesting if it does indeed have a significant impact on the resale price.
3.4 Multicolinearity Analysis
Finally, I will do a last check for multicolinearity through a correlation matrix. To do so, I would remove the geometry data with st_drop_geometry(). ggcorrmat() from ggstatsplot package will then be used to plot the matrix.
# Drop geometry
resale_nogeo <- resale_sf %>%
st_drop_geometry()
# List of variables
corr_list <- c("std_PROX_CBD", "std_PROX_ELDERCARE", "std_PROX_HAWKER", "std_PROX_MRT",
"std_PROX_PARK", "std_PROX_GDPRISCH", "std_PROX_MALL", "std_PROX_SUPERMARKET",
"std_WITHIN_350M_KINDERGARTEN", "std_WITHIN_350M_CHILDCARE", "std_WITHIN_350M_BUS",
"std_WITHIN_1KM_PRISCH", "std_UNIT_AREA", "std_UNIT_FLOOR", "std_UNIT_LEASE", "std_UNIT_AGE",
"mup_status", "hip_announced", "hip_completed")
# Build correlation matrix
ggcorrmat(resale_nogeo, corr_list)Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(corr_list)
# Now:
data %>% select(all_of(corr_list))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

Unsurprisingly, age of unit is correlated (-1) with remaining lease of unit since age is computed from remaining lease.
4 Multiple Linear Regression
5 Geographically Weighted Linear Regression
6 Conclusion
References
Manuel Lehner (2011). Modelling housing prices in Singapore applying spatial hedonic regression. Retrieved from https://archiv.ivt.ethz.ch/docs/students/sa307.pdf